Influences on the hotel market#

Student names: Prashant Jawalapersad, Jonathan Ogbuli, Jon Hoogervorst, Jacob Jan Woord

Team number: J2

Link to online github repository: https://jacobjanwoord.github.io/Info-Vis_first/docs/home.html

Introduction#

Millions of people everyday book a hotel online for various reasons. It can be because of a business trip or maybe just a relaxing vacation. But these bookings play a big role in influencing the success of hotels. We as a team are from the hotel market industry, and that is why we want to analyze what factors determine the hotel booking market.

In this data story we try to find the factors that determine the hotel booking market, and therefore also have a great influence on the market. Our analysis is based on two main perspectives: the number of flight passengers on a plane that booked a hotel and some analysis between city hotels and resorts. By using these perspectives we hope to get a better insight in the correlations between the hotel booking market and these factors. We try to find correlations

The analysis is based on datasets with a few good correlated variables, that provide good information on the hotel booking market. We use a dataset about hotel bookings that has information on the booking made by tourists, a passenger flight dataset that contains the number of people in a flight, the last dataset is about the amount of money a tourists spends during stay and a country dataset containing information on the economic side of a tourist’s home country. The variables we use in this dataset will be clear later on in the story.

Perspective 1: During summer-time resorts perform better than city hotels in comparison with other months

In the first perspective we focus more on the performance side of things, and is about comparing city hotel statistics with statistics of a resort. For this we analyze a few aspects of both such as, average daily rate which functions as an indicator of a hotel’s performance and profits., the duration someone stays in a hotel (weekdays and weekends) and the amount of bookings. Us analyzing the performance of these three sorts is interesting for the market, because when having the performance analyzed we can adjust prices accordingly and identify certain trends around the summer. With summer-time being the months June, July and August.

Perspective 2: Hotel booking market is determined by the relative increase/decrease of a flight passengers compared to previous months

The second perspective is about the relationship between hotel stay durations and flight passenger numbers. We check whether a higher number of passengers on a plane leads to a longer duration of stay in a hotel (weekdays and weekends). Us analyzing the influence of the amount of passengers on the duration of stay is important, because the longer the tourist stays in the hotel, the easier it gets managing other bookings with having less to worry about check-outs.

With this datastory we are going to analyze, but more importantly visualize data, to gain some insights on the hotel booking market.

Dataset and pre-processing#

Our first dataset is about hotel bookings. This dataset contains data of over 100.000 bookings. It contains data about when the hotel was booked, how long the stay was, the country of the person booking the hotel and much more. We split the dataset based on the type of hotel. So we now have a dataset for resorts and one for city hotels. Then we grouped the datasets based on the month of the booking and aggregated them. We used to mean function to do this. After this we added a Month_Year column and a Hotel Type column. The Month_Year was made to make the plots easier and the Hotel Type column was made because this information was lost while aggregating the datasets.

We combined the original hotel booking dataset with the Air traffic passengers dataset. This dataset has data about Air traffic and it tracks data like operating airline, terminal and passenger count. Both the hotel booking dataset and the Air traffic dataset where aggregated based on the month using the mean function. This allowed us to merge the two datasets based on month.

These datasets were found on Kaggle. Here are the links:
Hotel bookings - https://www.kaggle.com/datasets/mojtaba142/hotel-booking
Air traffic passengers - https://www.kaggle.com/datasets/thedevastator/airlines-traffic-passenger-statistics

Preprocessing code#

Hide code cell source
# imports
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
Hide code cell source
# get the data from the files.
hotel_data = pd.read_table("hotel_booking.csv", delimiter=";")
air_data = pd.read_table("Air_Traffic_Passenger_Statistics.csv", delimiter=";")

Preprocessing Perspective 1

Hide code cell source
# split hotel data in resort hotels and city hotels
resort_hotels = hotel_data[hotel_data['hotel'] == 'Resort Hotel']
city_hotels = hotel_data[hotel_data['hotel'] == 'City Hotel']

# group resort hotels based on month and year
grouped_data_resort = resort_hotels.groupby(['arrival_date_month','arrival_date_year'])
resort_aggregate = grouped_data_resort.aggregate({
    'is_canceled': 'mean',
    'lead_time': 'mean',
    'is_repeated_guest': 'mean',
    'previous_cancellations': 'mean',
    'previous_bookings_not_canceled': 'mean',
    'booking_changes': 'mean',
    'stays_in_weekend_nights': 'mean',
    'stays_in_week_nights': 'mean',
    'adults': 'mean',
    'children': 'mean',
    'babies': 'mean',
    'adr': 'mean',
    'required_car_parking_spaces': 'mean',
    'required_car_parking_spaces': 'mean'
})

# group city hotels based on month and year
grouped_data_city = city_hotels.groupby(['arrival_date_month','arrival_date_year'])
city_aggregate = grouped_data_city.aggregate({
    'is_canceled': 'mean',
    'lead_time': 'mean',
    'is_repeated_guest': 'mean',
    'previous_cancellations': 'mean',
    'previous_bookings_not_canceled': 'mean',
    'booking_changes': 'mean',
    'stays_in_weekend_nights': 'mean',
    'stays_in_week_nights': 'mean',
    'adults': 'mean',
    'children': 'mean',
    'babies': 'mean',
    'adr': 'mean',
    'required_car_parking_spaces': 'mean',
    'required_car_parking_spaces': 'mean'
})

# add Month_Year column to datasets 
resort_aggregate['Month_Year'] = resort_aggregate.index.get_level_values(0) + ' ' + resort_aggregate.index.get_level_values(1).astype(str)
city_aggregate['Month_Year'] = city_aggregate.index.get_level_values(0) + ' ' + city_aggregate.index.get_level_values(1).astype(str)

# order of the months for plots
month_year_order = ['August 2015', 'September 2015', 'October 2015',
                    'November 2015', 'December 2015', 'January 2016',
                    'February 2016', 'March 2016', 'April 2016',
                    'May 2016', 'June 2016', 'July 2016',
                    'August 2016', 'September 2016', 'October 2016',
                    'November 2016', 'December 2016', 'January 2017',
                    'February 2017', 'March 2017', 'April 2017',
                    'May 2017', 'June 2017', 'July 2017',
                    'August 2017']

# order datasets based on Month_Year column
resort_aggregate['Month_Year'] = pd.Categorical(resort_aggregate['Month_Year'], categories=month_year_order, ordered=True)
rdf_sorted = resort_aggregate.sort_values('Month_Year')

city_aggregate['Month_Year'] = pd.Categorical(city_aggregate['Month_Year'], categories=month_year_order, ordered=True)
cdf_sorted = city_aggregate.sort_values('Month_Year')

# add hotel type column to datasets and combine them
resort_aggregate['Hotel Type'] = 'Resort Hotel'
city_aggregate['Hotel Type'] = 'City Hotel'
combined_data = pd.concat([city_aggregate, resort_aggregate])

Preprocessing Perspective 2

Hide code cell source
# group and aggregate the hotel booking dataset based on month using the mean function
aggregate_hotel_month = hotel_data.groupby("arrival_date_month").aggregate({
    'is_canceled': 'mean',
    'lead_time': 'mean',
    'is_repeated_guest': 'mean',
    'previous_cancellations': 'mean',
    'previous_bookings_not_canceled': 'mean',
    'booking_changes': 'mean',
    'stays_in_weekend_nights': 'mean',
    'stays_in_week_nights': 'mean',
    'adults': 'mean',
    'children': 'mean',
    'babies': 'mean',
    'adr': 'mean',
    'required_car_parking_spaces': 'mean',
    'required_car_parking_spaces': 'mean'
})

# group and aggregate the air traffic dataset based on month using the mean function
aggregate_air_month = air_data.groupby("Month").aggregate({
    'Passenger Count': 'mean',
    'Adjusted Passenger Count': 'mean'
})

# merge hotel booking data with air traffic data
# a.reset_index().merge(b, how="left").set_index('index')
concat_air = aggregate_hotel_month.reset_index().merge(aggregate_air_month,how='left', left_on='arrival_date_month', right_on='Month').set_index('arrival_date_month')


# month names
month_names = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

Perspective 1:#

During summer-time resorts perform better than city hotels in comparison with other months

  • Argument 1:
    One prominent argument supporting the perspective that a resort hotel outperformms a city hotel during the summer is the guests length of stay. The longer the stay the better. The figure below illustrates the average length of stay per month throughout multiple years.

Hide code cell source
# Visualization 1: length of stay per month of resort and city hotels for weekend and week nights
# colors for plot
colors = px.colors.qualitative.T10

# city hotel weekend stay trace
trace1 = go.Bar(
    x=cdf_sorted['Month_Year'],
    y=cdf_sorted['stays_in_weekend_nights'],
    name='City Hotel (weekend)',
    marker=dict(color=colors[0])
)

# resort hotel weekend stay trace
trace2 = go.Bar(
    x=rdf_sorted['Month_Year'],
    y=rdf_sorted['stays_in_weekend_nights'],
    name='Resort Hotel (weekend)',
    marker=dict(color=colors[1])
)

# city hotel week stay trace
trace3 = go.Bar(
    x=cdf_sorted['Month_Year'],
    y=cdf_sorted['stays_in_week_nights'],
    name='City Hotel (week)',
    marker=dict(color=colors[2])
)

# resort hotel week stay trace
trace4 = go.Bar(
    x=rdf_sorted['Month_Year'],
    y=rdf_sorted['stays_in_week_nights'],
    name='Resort Hotel (week)',
    marker=dict(color=colors[3])
)

# combine traces
data = [trace1, trace2, trace3, trace4]

# layout of the plot
layout = go.Layout(
    title='<b>Visualization 1: Comparing the amount of nights spent at a City<br> and a Resort hotel</b>',
    xaxis=go.layout.XAxis(
        title='<b>Month and Year</b>',
        type='category'
    ),
    yaxis=go.layout.YAxis(
        title='<b>Amount of weekend nights</b>',
    ),
    barmode='group',

    # choice between weekend and week traces
    updatemenus=[ # used ChatGPT to make it interactive
        dict(
            buttons=list([
                dict(
                    args=[{'visible': [True, True, False, False]}, {'yaxis.title': '<b>Amount of weekend nights</b>'}],
                    label='Weekends',
                    method='update'
                ),
                dict(
                    args=[{'visible': [False, False, True, True]}, {'yaxis.title': '<b>Amount of week nights</b>'}],
                    label='Weekdays',
                    method='update'
                )
            ]),
            direction='down',
            showactive=True,
            x=.08,
            y=1.15
        )
    ], font=dict(size= 10)
)

data[2].visible = False
data[3].visible = False

figure = go.Figure(data=data, layout=layout)
figure.update_layout(
    title={'text':'<b>Visualization 1: Comparing the amount of nights spent at a City and a Resort hotel</b>',
    'y':0.95,
    'x':0.5,
    'xanchor': 'center',
    'yanchor': 'top'})
figure.show()

*Figure 1: The x-axis of this visualisation shows the months and their corresponding year. The y-axis shows the average number of nights guests spent at a City or Resort hotel during the week or weekend. We can see that during the summer months, the average length of stay at the Resort hotel substantially increases, while the length of stay at the City hotel doesn’t fluctuate as much.

As shown above, the data reveals a notable increase in the length of guest stays at resorts during the summer months compared to other seasons. This can be attributed to several factors, such as summer vacations, school breaks, and a higher demand for leisure activities. The extended duration of stays positively impacts the performance of resorts, contributing to their superior performance during this time.

  • Argument 2:
    The Average Daily Rate (ADR) functions as an indicator of a hotel’s performance and profits. ADR helps hotel owners gain insight into the average rate of roomsales over a period of time.

    Another compelling argument supporting the perspective that resorts excel during the summer is the substantial difference in the ADR between resorts and city hotels. The figures below present the ADR comparison between the two types of accommodations.

Hide code cell source
# Visualization 2: ADR for city and resort hotel, also per month
# box plot of adr for city and resort hotels
fig = px.box(combined_data, x='Hotel Type', y='adr', title='<b>Visualization 2: Comparison of ADR for a City and a Resort Hotel</b>')
fig.update_yaxes(
    title='<b>Average Daily Rate in euros</b>', secondary_y=False
)
fig.update_xaxes(title='<b>Hotel Type</b>')
fig.show()

*Figure 2: The x-axis shows the hotel type and the y-axis shows the ADR in euros. In this boxplot you can see that the ADR for the city hotel stays consistent with minimal outliers. The resort hotel’s box is bigger with a high maximum and low minimum, indicating larger fluctuation.

Hide code cell source
# ADR for city hotels per month
trace1 = go.Bar(
    x = cdf_sorted['Month_Year'],
    y = cdf_sorted['adr'],
    name='City Hotel',
    marker=dict(color='rgb(102,194,165)')
)

# ADR for resort hotels per month
trace2 = go.Bar(
    x = rdf_sorted['Month_Year'],
    y = rdf_sorted['adr'],
    name='Resort Hotel',
    marker=dict(color='rgb(255, 141, 98)')
)

# combine traces
data = [trace1, trace2]

# set layout of bar plot
layout = go.Layout(
    title='<b>Visualization 2: Comparison of ADR for a City and a Resort Hotel per month</b>',
    xaxis=go.layout.XAxis(
        title='<b>Month and Year</b>',
        type='category'
    ),
    yaxis=go.layout.YAxis(
        title='<b>Average Daily Rate in euros</b>',
    ),
    barmode='group',
)

go.Figure(data=data, layout=layout).show()

*Figure 3: On the x-axis the months and years are represented, and on the y-axis the ADR. In this bar chart you can see that the high maxima from the boxplot stem from the summer seasons.

Both figures clearly demonstrate that resorts command a significantly higher ADR during the summer months compared to city hotels. This suggests that guests are willing to spend more on accommodations when visiting resorts, indicating a higher level of demand and attractiveness. The increased ADR contributes to the financial success of resorts during the summer season.

  • Argument 3:
    One of the arguments supporting the perspective is that resort hotels receive relatively more bookings during summer months than other months compared to city hotels. This argument can be supported by a violin plot, shown below:

Hide code cell source
# Visualization 3: Total number of bookings per month per type of hotel
# month order for plot
month_order = ['December', 'November', 'October', 'September', 'August', 'July', 'June', 'May', 'April', 'March', 'February', 'January']
fig2 = go.Figure()

hotels = ['Resort Hotel', 'City Hotel']

# add trace for both hotels
for hotel in hotels:
    fig2.add_trace(
        go.Violin(
            x = hotel_data['hotel'][hotel_data['hotel'] == hotel],
            y = hotel_data['arrival_date_month'] ,
            name = hotel,
            box_visible = True,
            meanline_visible = True,
            bandwidth=0.5,
        )
    )

fig2.update_layout(yaxis=dict(categoryorder='array', categoryarray=month_order), title = 'Visualization 3: Number of bookings in both hotel types through the months')
fig2.show()

*Figure 4: This visualisation shows the distribution of the number of stays for both hotel types for all months. The y-axis contains all months, and the x-axis represents the number of stays. What stands out is that the city hotel grows and shrinks relatively calm through the year, while the resort hotel has a more sudden spike during the summer months.

This is possible because people are more likely to go to a resort during the summer, because these are almost exclusively used for vacation purposes. City hotels on the other hand can be visited for a number of reasons outside of vacation, such as business trips or for a layover. Because of this city hotels are used more throughout the year while resort hotels are more used during the summer.

Perspective 2:#

Hotel booking market is determined by the relative increase/decrease of air traffic passenger count compared to previous months.
This perspective is about the relationship between hotel stay durations and flight passenger numbers. We check whether a higher number of passengers on a plane leads to a longer duration of stay in a hotel (weekdays and weekends). Us analyzing the influence of the amount of passengers on the duration of stay is important, because the longer the tourist stays in the hotel, the easier it gets managing other bookings with having less to worry about check-outs.

  • Argument 1:
    The first argument is about the correlation between air traffic passengers and the duration someone stays in the hotel and is supporting the perspective that the hotel booking market is influenced by air traffic passenger count by the observation between the two variables. Visualisation 4 (figure 5) below illustrates the relationship between the number of air traffic passengers and the average length of hotel stays.

Hide code cell source
# Visualization 4: Passenger Count vs stays_in_week_nights and stays_in_weekend_nights per month
# passenger count bar
trace_pass = go.Bar(
    x = month_names,
    y = concat_air['Passenger Count'],
    name='Passenger Count'
)

# stays in week nights line
trace_week = go.Scatter(
    x = month_names,
    y = concat_air['stays_in_week_nights'],
    name='Stay in week nights', line=dict(width=4), marker=dict(size=12)
)

# stays in weekend nights line
trace_end = go.Scatter(
    x = month_names,
    y = concat_air['stays_in_weekend_nights'],
    name='Stay in weekend nights', line=dict(width=4), marker=dict(size=12, color='orange')
)

# second y, right side
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(trace_pass, secondary_y=False)
fig.add_trace(trace_week, secondary_y=True)
fig.add_trace(trace_end, secondary_y=True)

# add title
fig.update_layout(
    title_text="<b>Visualization 4: Analysis of airline passenger count and length of hotel stay</b>", 
    legend=dict(font=dict(size= 10), y=1, x =1.05)
)

# x-axis title
fig.update_xaxes(title_text="<b>Month</b>")

# y-axes titles
fig.update_yaxes(title_text="<b>Passenger Count</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Average length of hotel booking</b>", secondary_y=True)

fig.show()

*Figure 5: The x axis of this visualization shows the months and the y axis shows the average air traffic passenger count on the left side (blue bars) and on the right side it shows the amount of nights in the stay for week (red line) and weekend (orange line). We see that all three variables go up and down in roughly the same months. We can see that months with a higher passenger count also have longer stays.

Passenger count is the average air traffic passenger counts per month. Stays in week nights and stays in weekend nights show how long the hotel bookings are.

Figure 5 showcases a clear correlation between the increase or decrease in air traffic passenger count and the corresponding change in the average length of hotel stays. When air traffic passengers increase, indicating higher travel activity, the length of hotel stays also tends to increase. Conversely, a decrease in air traffic passengers is associated with shorter stays. This correlation suggests that the hotel booking market is linked to the amount of passengers in the air travel.

  • Argument 2:
    People tend to spend more in months with more air traffic passengers. We see that the average daily rate (ADR), which tells how busy it is and how much they spend, is higher in months that have more air traffic passengers. Visualization 5 (figure 6) below presents the ADR trend in relation to the number of air traffic passengers.

Hide code cell source
# Visualization 5: passenger count vs ADR per month
# passenger count bar
trace_pass = go.Bar(
    x = month_names,
    y = concat_air['Passenger Count'],
    name='Passenger Count'
)

# adr line
hotel_data = hotel_data[hotel_data['adr'] < 4000]
trace_adr = go.Box(x=hotel_data["arrival_date_month"], y=hotel_data["adr"], boxpoints=False, name="Average Daily Rate (€)")

# second y, right side
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(trace_pass, secondary_y=False)
fig.add_trace(trace_adr, secondary_y=True)

# add title
fig.update_layout(
    title_text="<b>Visualization 5: Analysis of airline passenger count and ADR</b>"
)

# x-axis title
fig.update_xaxes(title_text="<b>Month</b>")
fig.update_xaxes(categoryorder='array', categoryarray= ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])


# y-axes titles
fig.update_yaxes(title_text="<b>Passenger count</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>ADR</b>", secondary_y=True)

fig.show()

*Figure 6: This visualization shows that months with a higher passenger count also have a higher ADR. So we can conclude that hotel bookings are more expensive in months were the air traffic passenger count is higher. The x axis shows the months. The left y axis shows the passenger count and the right y axis shows the ADR.

Figure 6 indicates a association between the amount of air traffic passengers and the Average Daily Rate (ADR) in the hotel booking market. Months with a higher count of air traffic passengers tend to show an increased ADR, suggesting that there is a higher demand for hotel accommodations during periods of increased air travel. This relation emphasizes the impact of air traffic on the behaviour of a consumer and the dynamics of the market overall.

  • Argument 3:
    The last argument support the perspective is: in months were passenger count is higher, the bookings that happened in the same month tend to have a higher average lead time. A higher lead time is beneficial for the hotel market, so we can make sure their hotel is fully booked. Visualization 6 (Figure 7) below illustrates this relationship.

Hide code cell source
# Visualization 6: passenger count vs lead time per month
# create scatter plot passenger count and month, color is based on lead time
fig = px.scatter(
    concat_air, 
    y=concat_air.index, 
    x="Passenger Count", color='lead_time', 
    title='<b>Visualization 6: Analysis of Passenger count vs lead time per month</b>'
)
fig.update_yaxes(categoryorder='array', categoryarray=month_order)

# styling code
fig.update_traces(marker_size=15)
fig.update_layout(
    yaxis_title="<b>Month</b>", 
    xaxis_title="<b>Passenger Count</b>", 
    coloraxis_colorbar=dict(
        title="<b>Lead Time</b> (days)",
        thicknessmode="pixels",
        yanchor="top", y=1,
        ticks="outside",
        dtick=10
    )
)

fig.show()

*Figure 7: This visualization shows that in months (y axis) with a higher Passenger count(x axis) the average lead time (color) increases. Lead time means the time between the booking and the arrival date measured in days. We can conclude this gets higher in months with a higher passenger count.

Figure 7 demonstrates a positive correlation between the number of air traffic passengers and the average lead time of bookings. In months when the passenger count is higher, there is an increased duration between the booking date and the actual stay dates. This longer lead time allows hotels to plan their resources more effectively, ensuring their rooms are fully booked and maximizing their revenue potential. The relationship between air traffic passenger count and average lead time highlights the influence of passenger count on the hotel booking market.

Reflection#

During the making of this project, we got a moment to reflect on our datastory. In this reflection moment we got lots of critiques that we could use to improve our datastory. We got the feedback from our peers and our teaching assistant. During the peer feedback it became clear to us, what points of our data story were confusing and what we did well. Visually we were doing quite well, but we didn’t really have a well-running story and the structure was also not good.

The reason that our story was not well-running was because of our perspectives being too broad. One suggestion we got from our teaching assistant was that we needed to cut out at least one perspective to make the structure of the story clearer. With this advice in mind we decided to cut out two of our perspectives and add a fresh one to our story.

Some suggestions that we got from our peers were to clear up the arguments, because they were not very understandable at first sight. There were a few other points which we also addressed in our plots, such as changing variable order in the plots, and changing values.

Work Distribution#

Jonathan took the lead in making perspective 1, and was responsible for making two visualizations with arguments.
Jon also contributed to perspective 1, by making another visualization, and took on the role of task management. He ensured that deadlines were met and the progress was on track.
Jacob Jan played a huge role in preprocessing the datasets and refining arguments. He was also responsible for making perspective 2 and uploading the final project on the jupyter book git page.
Prashant had a multifaceted role in the team. And helped with visualizations and arguments when help was needed. Also responsible for the few writing sections on the project, such as introduction, reflection and work distribution.

Appendix#

Generative AI (ChatGPT with GPT 3.5) is used to facilitate the creation of this document, as shown in the table below.

Reasons of Usage

In which parts?

Which prompts were used?

Brainstorming ideas

All written sections

“Give ideas to write about this certain topic”

Enhance readability

All written sections

“Revise the paragraph and check this text for spelling mistakes.”

Improving datastory

All written sections

“Support this argument”

*Table 1: Usage of generative AI to facilitate the creation of this document.*references